AITopics | language tag

Collaborating Authors

language tag

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies

Mena, Carlos, Serra, Pol, Romero, Jacobo, Messaoudi, Abir, Giraldo, Jose, Armentano-Oller, Carme, Zevallos, Rodolfo, Meza, Ivan, Hernando, Javier

arXiv.org Artificial IntelligenceJul-21-2025

The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI's Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.13875

Country:

Europe (0.47)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.34)

Industry:

Media (0.47)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

Huang, Shao-Syuan, Huang, Kuan-Po, Liu, Andy T., Lee, Hung-yi

arXiv.org Artificial IntelligenceDec-20-2024

Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2412.16474

Country: Asia > Taiwan (0.16)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

LCS: A Language Converter Strategy for Zero-Shot Neural Machine Translation

Sun, Zengkui, Liu, Yijin, Meng, Fandong, Xu, Jinan, Chen, Yufeng, Zhou, Jie

arXiv.org Artificial IntelligenceJun-5-2024

Multilingual neural machine translation models generally distinguish translation directions by the language tag (LT) in front of the source or target sentences. However, current LT strategies cannot indicate the desired target language as expected on zero-shot translation, i.e., the off-target issue. Our analysis reveals that the indication of the target language is sensitive to the placement of the target LT. For example, when placing the target LT on the decoder side, the indication would rapidly degrade along with decoding steps, while placing the target LT on the encoder side would lead to copying or paraphrasing the source input. To address the above issues, we propose a simple yet effective strategy named Language Converter Strategy (LCS). By introducing the target language embedding into the top encoder layers, LCS mitigates confusion in the encoder and ensures stable language indication for the decoder. Experimental results on MultiUN, TED, and OPUS-100 datasets demonstrate that LCS could significantly mitigate the off-target issue, with language accuracy up to 95.28%, 96.21%, and 85.35% meanwhile outperforming the vanilla LT strategy by 3.07, 3,3, and 7.93 BLEU scores on zero-shot translation, respectively.

computational linguistic, translation, zero-shot translation, (16 more...)

arXiv.org Artificial Intelligence

2406.02876

Country:

North America > Dominican Republic (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Backdoor Attack on Multilingual Machine Translation

Wang, Jun, Xu, Qiongkai, He, Xuanli, Rubinstein, Benjamin I. P., Cohn, Trevor

arXiv.org Artificial IntelligenceApr-2-2024

While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages. Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages.

computational linguistic, language pair, translation, (14 more...)

arXiv.org Artificial Intelligence

2404.02393

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(12 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts

Nguyen, Xuan-Phi, Aljunied, Sharifah Mahani, Joty, Shafiq, Bing, Lidong

arXiv.org Artificial IntelligenceJun-20-2023

Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. However, in low-resource languages, obtaining such hand-picked exemplars can still be challenging, where unsupervised techniques may be necessary. Moreover, competent generative capabilities of LLMs are observed only in high-resource languages, while their performances among under-represented languages fall behind due to pre-training data imbalance. To elicit LLMs' ability onto low-resource languages without any supervised data, we propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. These prompts are then used to create intra-lingual exemplars to perform tasks in the target languages. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages. We also show that fine-tuning a 7B model on data generated from our method helps it perform competitively with a 175B model. In non-English translation tasks, our method even outperforms supervised prompting by up to 3 chrF++ in many low-resource languages. When evaluated on zero-shot multilingual summarization, our method surpasses other English-pivoting baselines by up to 4 ROUGE-L and is also favored by GPT-4.

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2306.11372

Country:

North America > United States > Washington > King County > Seattle (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Europe > Germany > Berlin (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Leveraging Language Identification to Enhance Code-Mixed Text Classification

Takawane, Gauri, Phaltankar, Abhishek, Patwardhan, Varad, Patil, Aryan, Joshi, Raviraj, Takalikar, Mukta S.

arXiv.org Artificial IntelligenceJun-8-2023

The usage of more than one language in the same text is referred to as Code Mixed. It is evident that there is a growing degree of adaption of the use of code-mixed data, especially English with a regional language, on social media platforms. Existing deep-learning models do not take advantage of the implicit language information in the code-mixed text. Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English Datasets by experimenting with language augmentation approaches. We propose a pipeline to improve code-mixed systems that comprise data preprocessing, word-level language identification, language augmentation, and model training on downstream tasks like sentiment analysis. For language augmentation in BERT models, we explore word-level interleaving and post-sentence placement of language information. We have examined the performance of vanilla BERT-based models and their code-mixed HingBERT counterparts on respective benchmark datasets, comparing their results with and without using word-level language information. The models were evaluated using metrics such as accuracy, precision, recall, and F1 score. Our findings show that the proposed language augmentation approaches work well across different BERT models. We demonstrate the importance of augmenting code-mixed text with language information on five different code-mixed Hindi-English downstream datasets based on sentiment analysis, hate speech detection, and emotion detection.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.04964

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Indonesia > Bali (0.04)
Asia > India > Maharashtra > Pune (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation

Mao, Zhuoyuan, Dabre, Raj, Liu, Qianying, Song, Haiyue, Chu, Chenhui, Kurohashi, Sadao

arXiv.org Artificial IntelligenceMay-16-2023

This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may overfit supervised directions and thus have low generalizability for ZST. Through experiments on OPUS, IWSLT, and Europarl datasets for 54 ZST directions, we demonstrate that the original Transformer setting of LayerNorm after residual connections (PostNorm) consistently outperforms PreNorm by up to 12.3 BLEU points. We then study the performance disparities by analyzing the differences in off-target rates and structural variations between PreNorm and PostNorm. This study highlights the need for careful consideration of the LayerNorm setting for ZST.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.09312

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Canada (0.04)
(13 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Dalal, Dwip, Srivastava, Vivek, Singh, Mayank

arXiv.org Artificial IntelligenceApr-2-2023

Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.

artificial intelligence, natural language, text processing, (20 more...)

arXiv.org Artificial Intelligence

2304.00634

Country:

Asia > Middle East > Jordan (0.05)
Asia > India > Gujarat > Gandhinagar (0.05)
Asia > India > West Bengal (0.04)
(2 more...)

Genre: Research Report (0.65)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.47)
Information Technology > Services (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)

Add feedback

Naming Languages - bryandragon.com

#artificialintelligenceJul-9-2021, 15:54:29 GMT

As part of the Novetta Mission Analytics team, I work on a data pipeline that ingests traditional and social media from around the world, enriches it, and makes the enriched data available to customers. Enrichment can involve any number of steps, many of them powered by machine learning, and one of the earliest and most common steps is translation. When new content arrives, the source language is often unknown and must be detected; if the source language is different from the target language, the content is also translated. In order to translate this volume of content automatically, accurately, and cost effectively, we rely on multiple cloud translation services. To the surprise of no one, cloud translation services differ not only in pricing but also in the languages they support and in the quality of translation across them. It's often most cost effective to perform language detection with one service and, depending on the detected language, translation with another. In addition, these services occasionally use different identifiers to refer to the same language, which requires us to do some mapping on our end.

language subtag, language tag, subtag, (14 more...)

#artificialintelligence

Country:

Asia > Taiwan (0.05)
South America > Brazil (0.04)
North America > United States (0.04)
(5 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.35)

Add feedback

All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

Patro, Jasabanta, Samanta, Bidisha, Singh, Saurabh, Basu, Abhipsa, Mukherjee, Prithwish, Choudhury, Monojit, Mukherjee, Animesh

arXiv.org Artificial IntelligenceJul-29-2017

In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

ground truth, rank list, tweet, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/D17-1240

1707.08446

Country:

Asia > Indonesia > Bali (0.05)
Asia > Macao (0.04)
Asia > China > Hong Kong (0.04)
(5 more...)

Genre:

Research Report (0.64)
Questionnaire & Opinion Survey (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback